Research Question: Which Census Tracts have structural, long-term increases in bus demand that communicates Philly needs permanent service expansion in the future?
Method: Spatio-temporal panel analysis using Negative Binomial regression with temporal lags, spatial lags, and tract-level fixed effects.
Data Scope: 2014-2023 SEPTA bus ridership across census tracts in Philadelphia region (5 counties). Restricted to fall season unfortunately, but this is actually one of the restrictions Dr. Delmelle was talking about addressing in the project, so it’s fine if it’s not thorough, she’s really just looking for our understanding of what we’re doing and what limitations we face with real-world data! :D
Strong temporal autocorrelation indicates lagged predictors could be critical.
Spatial clustering indicates need for spatial lag variables.
Weekday recovery differs substantially from weekend patterns.
County-level heterogeneity suggests fixed effects are required.
Setup and Initial Data Loading
Code
# Load libraries.library(tidyverse)library(lubridate)library(sf)library(scales)library(patchwork)library(viridis)library(kableExtra)library(corrplot)library(ggridges)library(knitr)# Disable scientific notation.options(scipen =999)# Consistent theme.theme_set(theme_minimal(base_size =12))# Download viz when knitting.opts_chunk$set(fig.path ="figures/", dev ="png")
Code
# Core panel data for spatio-temporal analysis.df_raw <-read.csv("data/septa/Bus_Ridership_by_Census_Tract.csv")# Initial data structure and dimensions.cat("Initial dataset dimensions:\n")
Initial dataset dimensions:
Code
print(dim(df_raw))
[1] 17891 9
Code
# First few rows to understand structure.head(df_raw) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
# Cleaned dataset with temporal and categorical features.df_clean <- df_raw %>%# Numeric year from Season column for time series analysis.mutate(Year =as.numeric(str_extract(Season, "\\d{4}"))) %>%# Remove rows with missing ridership counts that mess up aggregations.filter(!is.na(On_), !is.na(Off_)) %>%# Binary indicator for weekday vs. weekend patterns.mutate(Is_Weekday =ifelse(Day_of_Week =="Weekday", 1, 0),Day_Type =factor(Day_of_Week, levels =c("Weekday", "Weekend", "Sunday")) ) %>%# Pandemic period indicator for break analysis.mutate(Period =case_when( Year <2020~"Pre-Pandemic", Year ==2020~"Pandemic", Year >2020~"Post-Pandemic" ),Period =factor(Period, levels =c("Pre-Pandemic", "Pandemic", "Post-Pandemic")) ) %>%# Convert Census Tract ID to character for joining.mutate(Census_Tract_ID =as.character(Census_Tract_ID))# Summary stats of cleaned data showing key distributions.cat("Cleaned dataset dimensions:\n")
# Check temporal coverage across all years in panel.df_clean %>%count(Year, name ="Observations") %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover"))
Year
Observations
2014
852
2015
852
2016
852
2017
2556
2018
2556
2019
2556
2021
2556
2022
2555
2023
2556
Code
# Check zero-ridership tracts which may indicate service gaps.zero_ridership <- df_clean %>%filter(On_ ==0& Off_ ==0) %>%count(Year, name ="Zero_Ridership_Tracts")zero_ridership %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover"))
Year
Zero_Ridership_Tracts
2014
54
2015
38
2016
42
2017
246
2018
226
2019
197
2021
246
2022
166
2023
170
Code
# Percentage of zero observations per year.total_obs <- df_clean %>%count(Year, name ="Total")zero_pct <- zero_ridership %>%left_join(total_obs, by ="Year") %>%mutate(Percent_Zero =round(Zero_Ridership_Tracts / Total *100, 2))zero_pct %>%select(Year, Percent_Zero) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover"))
Year
Percent_Zero
2014
6.34
2015
4.46
2016
4.93
2017
9.62
2018
8.84
2019
7.71
2021
9.62
2022
6.50
2023
6.65
Code
# Geographic coverage across five-county SEPTA service area.df_clean %>%count(County_Name, name ="Tract_Observations") %>%arrange(desc(Tract_Observations)) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover"))
County_Name
Tract_Observations
Philadelphia County
8442
Montgomery County
3549
Delaware County
2961
Bucks County
1658
Chester County
1197
Mercer County
42
New Castle County
42
Code
# Unique tracts per county to understand spatial extent.df_clean %>%group_by(County_Name) %>%summarize(Unique_Tracts =n_distinct(Census_Tract_ID)) %>%arrange(desc(Unique_Tracts)) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover"))
County_Name
Unique_Tracts
Philadelphia County
402
Montgomery County
169
Delaware County
141
Bucks County
79
Chester County
57
Mercer County
2
New Castle County
2
Distribution Analysis
Ridership Distribution Histograms
Code
# Pivot On_ and Off_ counts into long format for facet.df_long <- df_clean %>%select(Census_Tract_ID, Year, Day_of_Week, On_, Off_) %>%pivot_longer(cols =c(On_, Off_),names_to ="Ridership_Type_Raw",values_to ="Count" ) %>%mutate(Ridership_Type =case_when( Ridership_Type_Raw =="On_"~"Boardings (On)", Ridership_Type_Raw =="Off_"~"Deboardings (Off)" ) )# Log transformation column for distribution comparison.df_hist_data <- df_long %>%mutate(Log_Count =log10(Count +1),Scale =factor("Raw Count", levels =c("Raw Count", "log(Count + 1)")) )# Duplicate data with log scale for second column facet.df_log_data <- df_hist_data %>%mutate(Count = Log_Count,Scale =factor("log(Count + 1)", levels =c("Raw Count", "log(Count + 1)")) )# Combine raw and log-transformed data for viz.df_final_hist <-bind_rows(df_hist_data, df_log_data) %>%filter(!is.na(Count))# 2x2 faceted histogram showing raw and log-transformed distributions.dist_plot <-ggplot(df_final_hist, aes(x = Count, fill = Ridership_Type)) +geom_histogram(bins =50, position ="identity", alpha =0.6) +scale_x_continuous(labels = comma) +scale_y_continuous(labels = comma) +scale_fill_manual(values =c("#2E86AB", "#A23B72")) +labs(title ="Ridership Distribution: Raw vs. Log-Transformed Counts",subtitle ="Right-Skew and Overdispersion = Negative Binomial > Poisson",x ="Ridership Count",y ="Frequency (Number of Observations)",fill ="Type",caption ="Data: SEPTA Bus Ridership 2014-2023 | All Census Tracts" ) +facet_grid(Ridership_Type ~ Scale, scales ="free") +theme_minimal(base_size =11) +theme(legend.position ="bottom",plot.title =element_text(face ="bold", size =14),strip.text =element_text(face ="bold") )# Distribution plot demonstrating need for count model.dist_plot
Overdispersion Statistics
Code
# Mean and variance for boardings to quantify overdispersion.overdispersion_stats <- df_clean %>%summarize(Mean_On =mean(On_, na.rm =TRUE),Variance_On =var(On_, na.rm =TRUE),Dispersion_Ratio = Variance_On / Mean_On,Mean_Off =mean(Off_, na.rm =TRUE),Variance_Off =var(Off_, na.rm =TRUE),Dispersion_Ratio_Off = Variance_Off / Mean_Off )cat(sprintf("Boardings (On_):\n"))
Overdispersion stats above (Variance / Mean) Dispersion ratio > 1 means overdispersion. Negative Binomial model handles this variance structure better than Poisson.
# Top 20 tracts with highest growth and steepest decline.top_gainers <- tract_recovery %>%filter(Y2019 >=100) %>%arrange(desc(Recovery_Ratio)) %>%head(20) %>%mutate(Type ="Top Gainers")top_decliners <- tract_recovery %>%filter(Y2019 >=100) %>%arrange(Recovery_Ratio) %>%head(20) %>%mutate(Type ="Steepest Declines")# Combine and viz winners vs. losers.winners_losers <-bind_rows(top_gainers, top_decliners)# Top 10 highest growth tracts in 2023 vs. 2019, minimum 100 boardings in 2019.top_gainers %>%head(10) %>%select(Census_Tract_ID, County_Name, Y2019, Y2023, Recovery_Ratio) %>%kable(digits =2, col.names =c("Tract ID", "County", "2019", "2023", "Recovery Ratio")) %>%kable_styling(bootstrap_options =c("striped", "hover"))
Tract ID
County
2019
2023
Recovery Ratio
42101000102
Philadelphia County
325
627
1.93
42101018801
Philadelphia County
736
1360
1.85
42101021900
Philadelphia County
101
164
1.62
42101034501
Philadelphia County
104
157
1.51
42101015300
Philadelphia County
326
457
1.40
42101021500
Philadelphia County
213
284
1.33
42029302500
Chester County
145
188
1.30
42101034502
Philadelphia County
837
1047
1.25
42101017300
Philadelphia County
998
1213
1.22
42101036000
Philadelphia County
741
869
1.17
Code
# Top 10 steepest decline tracts in 2023 vs. 2019, minimum 100 boardings in 2019.top_decliners %>%head(10) %>%select(Census_Tract_ID, County_Name, Y2019, Y2023, Recovery_Ratio) %>%kable(digits =2, col.names =c("Tract ID", "County", "2019", "2023", "Recovery Ratio")) %>%kable_styling(bootstrap_options =c("striped", "hover"))
Tract ID
County
2019
2023
Recovery Ratio
42091203207
Montgomery County
152
25
0.16
42101038400
Philadelphia County
170
31
0.18
10003010105
New Castle County
195
38
0.19
42029302706
Chester County
123
24
0.20
42017101808
Bucks County
102
27
0.26
42101034600
Philadelphia County
761
217
0.29
42101037800
Philadelphia County
838
247
0.29
42091203000
Montgomery County
108
36
0.33
42101021700
Philadelphia County
716
251
0.35
42091205810
Montgomery County
485
179
0.37
Brief Feature Engineering
Temporal Lag Variable
Code
# Temporal lag feature for all tracts and years.df_with_lag <- df_clean %>%filter(Day_of_Week =="Weekday") %>%arrange(Census_Tract_ID, Year) %>%group_by(Census_Tract_ID) %>%mutate(On_Lag1 =lag(On_, n =1, order_by = Year),Off_Lag1 =lag(Off_, n =1, order_by = Year) ) %>%ungroup()# Check missing lag values in first year for each tract.lag_missing <- df_with_lag %>%summarize(Total_Obs =n(),Missing_Lag =sum(is.na(On_Lag1)),Pct_Missing = Missing_Lag / Total_Obs *100 )
Code
# Temporal lag variable coverage.lag_missing %>%kable(digits =1) %>%kable_styling(bootstrap_options =c("striped", "hover"))
Total_Obs
Missing_Lag
Pct_Missing
7668
852
11.1
Code
# First few rows with lag variable.df_with_lag %>%select(Census_Tract_ID, Year, On_, On_Lag1) %>%head(15) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover"))
# Check for balanced panel structure.balanced_pct <-sum(tract_stats$Full_Panel) /nrow(tract_stats) *100cat(sprintf("\nPercentage of tracts with complete 10-year panel: %.1f%%\n", balanced_pct))
Percentage of tracts with complete 10-year panel: 0.0%
Coefficient of Variation
Code
# Coefficient of variation to identify stable vs. volatile tracts.cv_plot <-ggplot(tract_stats %>%filter(Mean_Boardings >10), aes(x = Mean_Boardings, y = CV_Boardings)) +geom_hex(bins =40) +scale_fill_viridis_c(option ="magma", trans ="log10", labels = comma) +scale_x_log10(labels = comma) +scale_y_log10() +geom_hline(yintercept =1, linetype ="dashed", color ="white") +labs(title ="Tract Volatility: Coefficient of Variation vs. Mean Ridership",subtitle ="Higher CV means less predictable ridership patterns requiring closer examination.",x ="Mean Weekday Boardings (2014-2023) [Log Scale]",y ="Coefficient of Variation (SD / Mean) [Log Scale]",fill ="Tract Count",caption ="Data: SEPTA Bus Ridership | CV = 1 Ref Line" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14) )cv_plot
Data Quality and Missingness
Zero-Ridership Pattern
Code
# Percent of zero-ridership observations by year and county.zero_patterns <- df_clean %>%mutate(Is_Zero =ifelse(On_ ==0& Off_ ==0, 1, 0)) %>%group_by(Year, County_Name) %>%summarize(Total_Obs =n(),Zero_Count =sum(Is_Zero),Zero_Pct = Zero_Count / Total_Obs *100,.groups ="drop" )# Zero-ridership patterns across counties and time.zero_heatmap <-ggplot(zero_patterns, aes(x =factor(Year), y = County_Name, fill = Zero_Pct)) +geom_tile(color ="white", linewidth =0.5) +scale_fill_gradient2(low ="#2ECC71", mid ="#F39C12", high ="#E74C3C",midpoint =10, name ="% Zero Ridership") +geom_text(aes(label =sprintf("%.1f%%", Zero_Pct)), size =3) +labs(title ="Zero-Ridership Patterns by County and Year",subtitle ="Percentage of tract-observations with no recorded bus boardings or deboardings.",x ="Year",y ="County",caption ="Data: SEPTA Bus Ridership" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),axis.text.x =element_text(angle =45, hjust =1) )zero_heatmap
Service Coverage
Code
# Tracts with consistent service vs. sporadic coverage.service_coverage <- df_clean %>%mutate(Has_Service =ifelse(On_ >0| Off_ >0, 1, 0)) %>%group_by(Census_Tract_ID, County_Name) %>%summarize(Years_With_Service =sum(Has_Service),Total_Years =n_distinct(Year),Coverage_Rate = Years_With_Service / Total_Years *100,.groups ="drop" ) %>%mutate(Coverage_Category =case_when( Coverage_Rate ==100~"Consistent Service (100%)", Coverage_Rate >=75~"Mostly Served (75-99%)", Coverage_Rate >=50~"Intermittent Service (50-74%)", Coverage_Rate >0~"Minimal Service (1-49%)",TRUE~"No Service" ) )# Service coverage distribution.service_coverage %>%count(Coverage_Category, name ="Tract_Count") %>%mutate(Percentage =round(Tract_Count /sum(Tract_Count) *100, 1)) %>%kable() %>%kable_styling(bootstrap_options =c("striped", "hover"))
Coverage_Category
Tract_Count
Percentage
Consistent Service (100%)
16
1.9
Intermittent Service (50-74%)
3
0.4
Minimal Service (1-49%)
6
0.7
Mostly Served (75-99%)
827
97.1
Cross-Sectional Variation
Distribution by Day Type
Code
# Violin plot comparing distributions across day types.day_violin <- df_clean %>%filter(On_ >0) %>%ggplot(aes(x = Day_of_Week, y =log10(On_ +1), fill = Day_of_Week)) +geom_violin(alpha =0.7, draw_quantiles =c(0.25, 0.5, 0.75)) +geom_boxplot(width =0.1, outlier.alpha =0.3, alpha =0.5) +scale_fill_manual(values =c("Weekday"="#2E86AB", "Weekend"="#A23B72", "Sunday"="#F18F01")) +labs(title ="Ridership Distribution by Day Type",subtitle ="Weekdays show higher and more concentrated ridership patterns.",x ="Day Type",y ="log(Boardings + 1)",fill ="Day Type",caption ="Data: SEPTA Bus Ridership 2014-2023" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),legend.position ="none" )day_violin
County Comparison Box Plots
Code
# Faceted box plots showing ridership distribution by county over time.county_box <- df_clean %>%filter(Day_of_Week =="Weekday", On_ >0) %>%ggplot(aes(x =factor(Year), y =log10(On_ +1), fill = County_Name)) +geom_boxplot(outlier.alpha =0.2, outlier.size =0.5) +scale_fill_viridis_d(option ="turbo") +labs(title ="Ridership Distribution by County Across Years",subtitle ="Philadelphia County dominates ridership volume with different recovery patterns.",x ="Year",y ="log(Weekday Boardings + 1)",fill ="County",caption ="Data: SEPTA Weekday Bus Ridership" ) +theme_minimal(base_size =11) +theme(plot.title =element_text(face ="bold", size =14),axis.text.x =element_text(angle =45, hjust =1),legend.position ="bottom" ) +facet_wrap(~County_Name, ncol =2, scales ="free_y")county_box
Correlation Analysis
Boardings vs. Deboardings Relationship
Code
# Correlation between boardings and deboardings.on_off_cor <-cor(df_clean$On_, df_clean$Off_, use ="complete.obs")# Scatter plot showing boarding-deboarding relationship.on_off_scatter <- df_clean %>%filter(Day_of_Week =="Weekday", Year ==2023) %>%ggplot(aes(x = On_ +1, y = Off_ +1)) +geom_hex(bins =50) +geom_abline(intercept =0, slope =1, linetype ="dashed", color ="red", linewidth =1) +scale_fill_viridis_c(option ="plasma", trans ="log10", labels = comma) +scale_x_log10(labels = comma) +scale_y_log10(labels = comma) +annotate("text", x =10, y =10000,label =sprintf("Correlation: %.3f", on_off_cor),size =5, color ="white", fontface ="bold") +labs(title ="Boardings vs. Deboardings Correlation (2023 Weekday)",subtitle ="Strong correlation suggests tracts serve both origins and destinations.",x ="Boardings (On_) [Log Scale]",y ="Deboardings (Off_) [Log Scale]",fill ="Density",caption ="Data: SEPTA Weekday Bus Ridership | Diagonal = Perfect Balance" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14) )on_off_scatter
# Check for data leakage ensuring no overlap.cat("\nData Leakage Check:\n")
Data Leakage Check:
Code
overlap <-intersect(paste(df_train$Census_Tract_ID, df_train$Year),paste(df_test$Census_Tract_ID, df_test$Year))# Should be 0.cat(sprintf("Overlapping Observations: %d\n", length(overlap)))
# Load Pennsylvania census tract geometry for spatial visualization.tract_geo <-st_read("data/pa_tracts/pa_tracts.shp")
Reading layer `pa_tracts' from data source
`C:\Users\Tess\Desktop\UPenn\UPenn_FW25\MUSA_5080-401_Public_Policy_Analytics\shark-tank\data\pa_tracts\pa_tracts.shp'
using driver `ESRI Shapefile'
Simple feature collection with 3446 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -80.51985 ymin: 39.7198 xmax: -74.68956 ymax: 42.51607
Geodetic CRS: NAD83
Code
# Define FIPS codes for Philadelphia and surrounding counties.septa_county_fips <-c("017", "029", "045", "091", "101")# Alternative: Philadelphia only.# septa_county_fips <- c("101")# Transform to WGS84 coordinate system and filter to SEPTA service area.final_map_base <- tract_geo %>%st_transform(4326) %>%filter(COUNTYFP %in% septa_county_fips)# Calculate total ridership by census tract for Fall 2023.current_ridership <- df_clean %>%filter(Year ==2023) %>%group_by(Census_Tract_ID) %>%summarize(Final_On_Count =sum(On_),Final_Off_Count =sum(Off_) ) %>%mutate(Census_Tract_ID =as.character(Census_Tract_ID))# Join ridership data to geometry for spatial visualization.tract_map_data <- final_map_base %>%left_join(current_ridership, by =c("GEOID"="Census_Tract_ID"))# Filter to only tracts with recorded ridership data.tract_map_data_with_ridership <- tract_map_data %>%filter(!is.na(Final_On_Count))# Load SEPTA transit route geometry for network visualization.routes <-st_read("data/septa/Transit_Routes_Spring_2025/Transit_Routes.shp")
Reading layer `Transit_Routes' from data source
`C:\Users\Tess\Desktop\UPenn\UPenn_FW25\MUSA_5080-401_Public_Policy_Analytics\shark-tank\data\septa\Transit_Routes_Spring_2025\Transit_Routes.shp'
using driver `ESRI Shapefile'
Simple feature collection with 178 features and 18 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: -8441925 ymin: 4837473 xmax: -8321361 ymax: 4914725
Projected CRS: WGS 84 / Pseudo-Mercator
Code
# Create choropleth map of boarding counts using log scale transformation.spatial_plot_log_on <-ggplot(tract_map_data_with_ridership) +geom_sf(aes(fill = Final_On_Count +1), color ="#ff4100", linewidth =0.25) +scale_fill_viridis_c(name ="Avg Daily Boardings",trans ="log10",labels = comma,na.value ="gray",option ="mako",direction =-1 ) +labs(title ="Spatial Distribution of Bus Ridership (Fall 2023)",subtitle ="Geographic clustering of log(Rider On Count + 1)." ) +theme_void() +theme(plot.background =element_rect(fill ="#dedbd4", color ="#dedbd4"))# Create choropleth map of alighting counts using log scale transformation.spatial_plot_log_off <-ggplot(tract_map_data_with_ridership) +geom_sf(aes(fill = Final_Off_Count +1), color ="#ff4100", linewidth =0.25) +scale_fill_viridis_c(name ="Avg Daily Alightings",trans ="log10",labels = comma,na.value ="gray",option ="mako",direction =-1 ) +labs(title ="Spatial Distribution of Bus Ridership (Fall 2023)",subtitle ="Geographic clustering of log(Rider Off Count + 1)." ) +theme_void() +theme(plot.background =element_rect(fill ="#dedbd4", color ="#dedbd4"))# Create map showing complete SEPTA route network over census tracts.septa_route_map <-ggplot(tract_map_data_with_ridership) +geom_sf(fill =NA, color ="#ff4100", linewidth =0.25) +geom_sf(data = routes, color ="#778ac5", fill =NA, linewidth =0.5) +scale_color_discrete() +labs(title ="SEPTA Routes",subtitle ="All SEPTA Routes" ) +theme_void() +theme(plot.background =element_rect(fill ="#dedbd4", color ="#dedbd4"))# Combine all three maps vertically with shared legend at bottom.(septa_route_map / spatial_plot_log_on / spatial_plot_log_off) +plot_layout(guides ="collect") &theme(legend.position ="bottom",legend.direction ="horizontal" )
Key Things for Negative Binomial Model
Overdispersion Justification
Variance/Mean Ratio: 2365.82 > 1.
Negative Binomial model handles overdispersion better than Poisson.
Stop-level data (Fall 2023): Granular boarding patterns at individual bus stops.
Route performance (2019-2024): Temporal trends for individual bus routes.
Modal comparison (2019-2024): Bus vs. Regional Rail recovery.
Regional Rail stations: Potential bus-rail transfer points and competition.
Helps identify high-performing routes for service replication, understand bus-rail modal shifts, stop-level hotspot analysis for micro-interventions, check tract-level patterns with finer ridership.
Stop Analysis
Load Stop-Level Data
Code
# Load Fall 2023 bus stop summary with geographic coordinates.df_stops <-read.csv("data/septa/Fall_2023_Stop_Summary_Bus.csv")# Clean and prepare stop data with total ridership calculations.df_stops_clean <- df_stops %>%mutate(# Total boardings across all day types.Total_Boardings = Weekday_On + Saturday_O + Sunday_Ons,Total_Alightings = Weekday_Of + Saturday_1 + Sunday_Off,Total_Activity = Total_Boardings + Total_Alightings,# Dominant day type indicator.Weekday_Dominance = Weekday_On / (Weekday_On + Saturday_O + Sunday_Ons +0.001),# Clean route numbers. Remove non-numeric characters after trying to convert numeric characters.Route_Numeric =as.numeric(ifelse(grepl("^[0-9]+$", Route), Route, NA)) ) %>%# Filter stops with no activity.filter(Total_Activity >0)# Summary stats for stop-level data.cat(sprintf("Total Stops: %s\n", comma(n_distinct(df_stops_clean$Stop_Code))))
# Histogram of weekday boardings distribution across all stops.stop_dist_plot <- df_stops_clean %>%filter(Weekday_On >0) %>%ggplot(aes(x =log10(Weekday_On +1))) +geom_histogram(bins =50, fill ="#3498DB", color ="white", alpha =0.8) +geom_vline(xintercept =log10(median(df_stops_clean$Weekday_On[df_stops_clean$Weekday_On >0]) +1),linetype ="dashed", color ="#E74C3C", linewidth =1) +annotate("text", x =log10(median(df_stops_clean$Weekday_On[df_stops_clean$Weekday_On >0]) +1), y =Inf,label =sprintf("Median = %.0f", median(df_stops_clean$Weekday_On[df_stops_clean$Weekday_On >0])),vjust =2, hjust =1.1, color ="#E74C3C", fontface ="bold") +labs(title ="Distribution of Weekday Boardings Across Bus Stops",subtitle ="High concentration in few stops suggests strategic intervention points.",x ="log(Weekday Boardings + 1)",y ="Number of Stops",caption ="Data: SEPTA Fall 2023 Bus Stop Summary | Positive Boardings Only" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14))stop_dist_plot
Code
# Concentration stats.total_boardings <-sum(df_stops_clean$Weekday_On)top_10pct_threshold <-quantile(df_stops_clean$Weekday_On, 0.90)top_10pct_boardings <-sum(df_stops_clean$Weekday_On[df_stops_clean$Weekday_On >= top_10pct_threshold])concentration_pct <- top_10pct_boardings / total_boardings *100cat(sprintf("Top 10%% of stops account for %.1f%% of all weekday boardings\n", concentration_pct))
Top 10% of stops account for 59.7% of all weekday boardings
Weekday Dominance
Code
# Proportion of activity occurring on weekdays vs. weekends.dominance_plot <- df_stops_clean %>%filter(Total_Boardings >10) %>%ggplot(aes(x = Weekday_Dominance)) +geom_histogram(bins =50, fill ="#9B59B6", color ="white", alpha =0.8) +geom_vline(xintercept =0.5, linetype ="dashed", color ="black", linewidth =1) +annotate("text", x =0.5, y =Inf, label ="Equal Split", vjust =2, hjust =1.1, fontface ="bold") +scale_x_continuous(labels = percent) +labs(title ="Weekday Ridership Dominance at Bus Stops",subtitle ="Most stops are weekday-dominant reflecting commuting patterns.",x ="Proportion of Total Boardings on Weekdays",y ="Number of Stops",caption ="Data: SEPTA Fall 2023 | Stops with 10+ Total Boardings" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14))dominance_plot
# Spatial map of regional rail stations.rr_spatial_plot <- rr_station_balance %>%ggplot() +geom_sf(data = tract_map_data, fill =NA, color ="gray60", linewidth =0.25) +geom_point(aes(x = Longitude, y = Latitude, size = Weekday, color = Station_Type),alpha =0.7 ) +scale_size_continuous(range =c(2, 10), labels = comma, name ="Weekday Activity") +scale_color_manual(values =c("Strong Commuter"="#E74C3C", "Mixed Use"="#F39C12", "All-Day/Leisure"="#27AE60"),name ="Station Type" ) +labs(title ="Regional Rail Station Distribution and Activity (2023)",subtitle ="Point size indicates weekday activity level | Color shows commuter orientation.",x ="Longitude",y ="Latitude",caption ="Data: SEPTA Regional Rail Station Summary" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),legend.position ="right",panel.grid =element_line(color ="gray90") ) +coord_sf(xlim =c(min_lon, max_lon),ylim =c(min_lat, max_lat) )rr_spatial_plot
Bus-Rail Interactions
Feeder Routes: Bus routes connecting to major rail stations.
Competitive Corridors: Parallel bus and rail service on same routes.
Modal Shift: Changes in bus ridership near rail stations during recovery.
Note: Full analysis requires spatial join of tract centroids to rail station buffers.
Potential Distance to Rail Station Feature?
Tracts farther from rail are more bus-dependent, so there could be a negative coefficient meaning the greater the distance from the rail, the higher the bus ridership. Could identify underserved areas for new rail or BRT investment!
Stop-Route Density Feature?
Stops per square mile by tract could be an option, so we could count count stops per tract and normalize by tract area?
A route diversity FE could be created too, so the number of unique routes serving each tract? Or we could actually create an index which could be better, but I don’t think it’s in the assignment toolbox.
Tracts with more stops/routes likely have higher ridership, so service expansion should prioritize low-density areas.
More Insights
Key Stuff from Contextual SEPTA Data
Stop-Level Patterns
Top 10% of stops account for 59.7% of boardings.
Mean weekday dominance: 52.0%.
Spatial hotspots concentrated in Center City and transit hubs.
Route-Level Recovery
Median recovery ratio: 0.72.
Routes with full recovery: 1.526718.
High variation suggests route-specific interventions needed.
Modal Dynamics
Bus-Rail correlation: 0.969.
Bus recovered faster than Regional Rail post-pandemic.
Modal share shift toward bus indicates changing commute patterns.
Bus-Rail Integration
Strong commuter stations may compete with parallel bus routes.
Mixed-use stations create feeder bus opportunities.
Distance to rail is key predictor for bus dependency.
Implications for Tract-Level Model
Stop density metric enhances tract-level service quality proxy.
Route diversity predicts resilience to service disruptions.